chore: Update to arrow/parquet 59.0.0#22744
Conversation
| .bloom_filter_properties(&ColumnPath::from("")) | ||
| .expect("expected bloom properties!") | ||
| .fpp, | ||
| .fpp(), |
There was a problem hiding this comment.
These fields are made private in
| }; | ||
| if let Some(bloom_filter_ndv) = bloom_filter_ndv { | ||
| builder = builder.set_bloom_filter_ndv(*bloom_filter_ndv); | ||
| builder = builder.set_bloom_filter_max_ndv(*bloom_filter_ndv); |
There was a problem hiding this comment.
- due to feat(parquet): add BloomFilterPropertiesBuilder arrow-rs#9877 which deprecated set_bloom_filtr_ndv
| ndv: DEFAULT_BLOOM_FILTER_NDV | ||
| }), | ||
| Some( | ||
| &BloomFilterProperties::builder() |
There was a problem hiding this comment.
Properties are now built with builder:
| &[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], | ||
| &[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 5, 6], | ||
| ]))], | ||
| vec![Arc::new( |
There was a problem hiding this comment.
from impl was removed because it could panic
| scale: 2, | ||
| precision: 9, | ||
| }) | ||
| .with_logical_type(LogicalType::decimal(2, 9)) |
There was a problem hiding this comment.
have to use new helpers added in
| ] | ||
|
|
||
| [[package]] | ||
| name = "integer-encoding" |
There was a problem hiding this comment.
yay for removing older deps
| "cfg-if", | ||
| ] | ||
|
|
||
| [[package]] |
There was a problem hiding this comment.
no more thrift! We now use the entirely new thrift encoder and not the thrift generator
55c7eb3 to
e616241
Compare
| | alltypes_plain.parquet | 1851 | 8882 | 2 | page_index=false | | ||
| | alltypes_tiny_pages.parquet | 454233 | 269074 | 2 | page_index=true | | ||
| | lz4_raw_compressed_larger.parquet | 380836 | 1339 | 2 | page_index=false | | ||
| | alltypes_plain.parquet | 1851 | 8794 | 2 | page_index=false | |
There was a problem hiding this comment.
I think this changed (smaller in memory size) due to the representation change of CompressionCodec in this pr
It changes from Compression which also carries the compression level: ZSTD(ZstdLevel), GZIP(GzipLevel), BROTLI(BrotliLevel) — and ZstdLevel(i32) / GzipLevel(u32) / BrotliLevel(u32) and are 4-byte wrappers. So Compression = 4-byte discriminant + 4-byte level = 8 bytes.
To a fieldless enum CompressionCodec -- 1 byte
| Total Requests: 2 | ||
| - GET (opts) path=parquet_table.parquet head=true | ||
| - GET (ranges) path=parquet_table.parquet ranges=1064-1481,1481-1594,1594-2011,2011-2124 | ||
| - GET (ranges) path=parquet_table.parquet ranges=1064-1594,1594-2124 |
There was a problem hiding this comment.
this seems like an improvement -- contiguous ranges are coalesced into fewer ranges. I tracked it down to this PR from @HippoBaro
e616241 to
4d274ef
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment was marked as outdated.
This comment was marked as outdated.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Performance looks about the same |
|
Excited to see this land now that the arrow release is out! |
4d274ef to
64976e9
Compare
|
run benchmarks |
|
I just completed the arrow release And rebased this PR to use the released arrow |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing alamb/update_arrow_59 (64976e9) to 8995ce6 (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
| ] | ||
|
|
||
| [[package]] | ||
| name = "thrift" |
There was a problem hiding this comment.
no more thrift dependency!
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing alamb/update_arrow_59 (64976e9) to 8995ce6 (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing alamb/update_arrow_59 (64976e9) to 8995ce6 (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch — base (merge-base)
tpch — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
zhuqi-lucas
left a comment
There was a problem hiding this comment.
LGTM thanks @alamb !
|
Thank you @adriangb and @zhuqi-lucas |
## Which issue does this PR close? - related to apache/arrow-rs#9110 ## Rationale for this change Update to latest version of arrow/parquet ## What changes are included in this PR? 1. Update to arrow/parquet 59.0.0 2. Adjust code for API differences ## Are these changes tested? By CI ## Are there any user-facing changes? New dependency
…+ into_builder pruning Replaces the upfront 'build one ParquetPushDecoder per run' design with a single live decoder driven via try_next_reader, plus a lightweight VecDeque<RgPlanEntry> tracking remaining row groups. At each RG boundary the runtime pruner peeks the next RG's stats against the current dynamic filter; if it can be pruned, we drop the head of the plan and rebuild the decoder via ParquetPushDecoder::into_builder (arrow-rs 59.0.0 via apache#22744) with a tightened with_row_groups list. Per adriangb's suggestion on apache#22450. Wins over the previous multi-decoder design: - No wasted decoder construction for pruned runs (the old code eagerly built N decoders, then discarded the ones the pruner rejected). - Per-RG pruning granularity instead of per-run (the old code could only skip whole runs; consecutive prunable RGs split across runs paid the boundary tax). - Single mid-flight decoder reuses buffered bytes across into_builder rebuilds (arrow-rs preserves them). Critical correctness fix during development: the rg_plan must use prepared_access_plan.row_group_indexes (the POST-reorder list emitted by ParquetAccessPlan::prepare → reorder_by_statistics + reverse) — not the natural-order access_plan.row_group_indexes() captured before prepare. The reorder runs whenever sort_order_for_reorder is set; if the pruner consults RG metadata in natural order while the decoder yields readers in stats-optimal order, it checks the wrong stats and either drops live rows or rebuilds a decoder over the wrong RGs. Regression covered by the new dynamic_rg_pruning_handles_sort_pushdown_reorder integration test: that test was written to fail under the pre-fix code and pass under the fix, double-verified by temporarily reintroducing the buggy ordering during review. Trade-off accepted, with follow-up: the OLD design used split_runs to build separate decoders for fully-matched vs needs-filter runs, so fully-matched RGs paid zero row-filter eval. With one shared decoder the row_filter stays installed across all RGs, and arrow-rs has no public API to clear it once set (only with_row_filter setter exists). Net impact is small — decode (the dominant per-row cost) still runs on fully-matched RGs anyway; only the row_filter.evaluate call is wasted. Restoring the optimization cleanly requires either an upstream ParquetPushDecoderBuilder::without_row_filter, or re-introducing the per-section decoder split inside the new state machine. Tracked as follow-up. Net diff: -107 lines. Dead code from the multi-decoder path (ParquetAccessPlan::split_runs + RowGroupRun + has_fully_matched + RowFilterGenerator::has_row_filter + four split_runs unit tests) is removed. All affected tests pass: 142 datasource-parquet lib tests + 206 parquet_integration (including the new regression test) + SLT sweep across dynamic_row_group_pruning, sort_pushdown, limit_pruning, push_down_filter_regression, clickbench, projection_pushdown, push_down_filter_parquet, dynamic_filter_pushdown_config, preserve_file_partitioning, limit.
Which issue does this PR close?
59.0.0(May 2026) arrow-rs#9110Rationale for this change
Update to latest version of arrow/parquet
What changes are included in this PR?
Are these changes tested?
By CI
Are there any user-facing changes?
New dependency